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Abstract. In MCMC methods, such as the Metropolis-Hastings (MH) algorithm, the Gibbs 
sampler, or recent adaptive methods, many different strategies can be proposed, often asso- 
ciated in practice to unknown rates of convergence. In this paper we propose a simulation- 
based methodology to compare these rates of convergence, grounded on an entropy criterion 
computed from parallel (i.i.d.) simulated Markov chains coming from each candidate strat- 
egy. Our criterion determines on the very first iterations the best strategy among the 
candidates. Theoretically, we give for the MH algorithm general conditions under which 
its successive densities satisfy adequate smoothness and tail properties, so that this en- 
tropy criterion can be estimated consistently using kernel density estimate and Monte Carlo 
integration. Simulated examples are provided to illustrate this convergence criterion. 
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1 Introduction 

A Markov Chain Monte Carlo (MCMC) method generates an ergodic Markov chain 
for which the stationary distribution is a given probability density function 
(pdf) / over a state space !]Ci s . In situations where direct simulation from / is 
not tractable, or where integrals like Ey[/i] = J h(x)f(x) dx are not available in closed 
form, MCMC method is appropriate since, for T large enough, x^ is approximately 
/ distributed, and Kf[h] can be approximated by ergodic averages from the chain. 
A major context is Bayesian inference, where / is a posterior distribution, usually 
known only up to a multiplicative normalization constant. 

The Metropolis-Hastings (MH) algorithm (Hastings |27j ) is one of the most pop- 
ular algorithm used in MCMC methods. Another commonly used MCMC methods 
is the Gibbs sampler (first introduced by Geman and Geman ^5]; see also Gelfand 
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and Smith |17j). An account of definitions and convergence properties of Gibbs and 
MH algorithms can be found, e.g., in Gilks et al. |21j . 

In this paper, the theoretical developments will be focused on the MH algorithm, 
since the generic form of its kernel allows for a general study, as indicated below. 
However, our proposed methodology can be applied empirically to any MCMC al- 
gorithm (e.g., to the Gibbs sampler). It can also be applied to compare the various 
recent adaptive methods, which is an area of current and growing research in MCMC. 

For the MH algorithm, the "target" pdf / needs to be known only up to a 
(normalizing) multiplicative constant. Each step is based on the generation of the 
proposed next move y from a general conditional density q(y\x), called the instru- 
mental distribution or proposal density (hence a practical requirement is that simu- 
lations from q should be done easily). For a starting value x^ ~ p°, the n-th step 
x (n) _^ x (n+l) Q £ algorithm is as follows: 



1. generate y ~ q(-\x^) 

♦ i in) n • L f(y)Q(x (n) \y) 

2. compute a(x ( ' ,y) = mm < 1, 



f(x^y)q(y\x( n )) 



3. take x 



(n+l) 



y with probability a(x^ n \y), 
x( n ) with probability 1 — a{x^ n \y). 



Two well-known MH strategies are (i) the (so-called) Independence Sampler (IS), 
i.e. the MH algorithm with proposal distribution q{y\x) = q(y) independent of the 
current position, and (ii) the Random Walk MH algorithm (RWMH), for which the 
proposal is a random perturbation u of the current position, y = x^ +u. The usual 
choice for the latter is a gaussian perturbation with a fixed variance matrix (e.g., 
in the one-dimensional case, q(y\x) is the pdf of J\f(x,a 2 ) where a 2 is the scaling 
parameter of the perturbation, that has to be tuned). 

Ergodicity and convergence properties of the MH algorithm have been intensively 
studied in the literature, and conditions have been given for its geometric conver- 
gence (see, e.g., Mengersen and Tweedie [SHI, or Roberts and Tweedie [H2)- In 
particular, Mengersen and Tweedie proved geometric convergence in total variation 
norm of the IS, under the condition q(y) > af{y) for some a > 0. The associ- 
ated geometric rate is (1 — a) n , not surprisingly pointing out the link between the 
convergence rate and the proximity of q to /. 

To actually implement the MH algorithm, a virtually unlimited number of choices 
for the instrumental distribution can be made, with the goal of improving mixing 
and convergence properties of the resulting Markov chain. If one wants to use the IS 
strategy, selection of a reasonably "good" proposal density can be done using sev- 
eral procedures, among which: numerical analysis or a priori knowledge about the 
target to approximate the shape of / (modes,. . . ); preliminary MCMC experiment, 
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or adaptive methods to dynamically build a proposal density on the basis of the 
chain(s) history (Gelfand and Sahu ^S], Gilks et dl. [20], Chauveau and Van- 
dekerkhove 0, Haario et al. 26 J). If one wants to use the RWMH strategy, "good" 
scaling constants must be found, since the mixing depends dramatically on the vari- 
ance matrix of the perturbation (see, e.g., Roberts and Rosenthal [HE])- However, 
these various choices are associated in general to unknown rates of convergence, be- 
cause of the complexity of the kernel, and of the associated theoretical computations 
of bounds. 

The Gibbs sampler (Geman and Geman ^!5]) is defined in a multidimensional 
setup (s > 1). It consists in simulating a Markov chain = (x^ , . . . , Xg) by 
simulating each (not necessarily scalar) coordinate according to a decomposition of 
/ in a set of its full conditional distributions. In the case of a decomposition in s 
scalar coordinates, the nth step of the Gibbs sampler is: 



Their exists formally many possible decomposition of / in a set of full conditionals, 
each of which resulting in a different Gibbs sampler. In addition, data augmentation 
schemes (see Tanner and Wong 38 J may be used to sample from /, which gives 
even more possibilities, resulting here also in several simulation strategies that lead 
to generally unknown rates of convergence. Hence our entropy criterion may be 
used also in this setup to compare different Gibbs samplers, or to compare Gibbs 
samplers against MH algorithms or other strategies for the same target /. 

The motivation of this paper is thus to propose a method to compare the rates 
of convergence of several candidate simulation algorithms (designed for the same 
target /), solely on the basis of the simulated output from each Markov chain. Note 
that the question of selecting the best MH strategy among a family of proposal den- 
sities is the subject of recent developments (see, e.g., Gasemyr jXS] for a heuristical 
solution using adaptation, Mira :32_ for an ordering of MCMC algorithms based on 
their asymptotic precisions, or Rigat [35]). 

We suggest the use of an entropy criterion between the pdf of each algorithm at 
time n and the target density /. The computation of an estimate of this criterion 
requires the simulation, for a short duration no, of N parallel (i.i.d.) chains coming 
from each strategy. This can be seen as a pre-run, to determine the best algorithm 
before running it for the long duration required by the MCMC methodology. 

More precisely, assume we have two simulation strategies, generating two MCMC 
algorithms with densities denoted by p" and p?> a t time (iteration) n. For the 
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comparison, both algorithms are started with the same initial distribution, i.e. p\ = 
p\. Define the relative entropy of a probability density p by 

%(p) = J p(x) log p(x) dx . (1) 

A natural measure of the algorithm's quality is the evolution in time (n) of the 
Kullback-Leibler "divergence" between pf, i = 1,2, and /, given by 

K(pf,f) = j log (^\ P f(x)dx = H(p2)-E p n[logf]. 

The behavior of the application n — ► IC(p2, f) will be detailed in sectional 

When / is analytically known, an a.s. consistent estimation of E p »[log/] is ob- 
tained easily by Monte-Carlo integration using N i.i.d. realisations from the algo- 
rithm at time n. Unfortunately in the MCMC setup, / is usually the posterior den- 
sity of a Bayesian model, so that /(•) = Cip(-) where the normalization constant C 
cannot be computed, hence this direct estimation cannot be done. However, if we 
want to compare two strategies, knowing C is not needed to estimate the difference 
of the divergences with respect to /. Define 

D(p?,p n 2 ,f) = £(p?, /)-£(?&/) 

= H(K)-H(^)+E p j[log^]-E^[log^]. (2) 

The Kullback criterion is the only divergence insuring this property, hence it mo- 
tivates our choice for applying it. One may think of other distances, such as L 1 
or L 2 , but estimating such distances requires regularity conditions similar to ours 
(see, e.g., Devroye |lflj). In addition, using other divergences would require an es- 
timation of C by other techniques, which is typically not feasible in actual MCMC 
situations. Note also that the Kullback divergence is currently used as a criterion 
in other simulation approaches (see Douc et al. 

We propose to use the N i.i.d. simulations from each of the two strategies at 
time n, 

(Xl' n ,...,X% n ) i.i.d.~p?,t = l,2. 

These simulations are first used to estimate E p ™ [log <p] via Monte-Carlo integration. 
Denote these estimates by 

i N 

p?(log^ = jj 5>g ^ Etfpogd, (3) 

3=1 

where the convergence comes from the strong law of large numbers. 

The remaining problem is then the estimation of the entropies 7i(pf), i = 1,2. 
One classical approach is to build a nonparametric kernel density estimate of pf, 
and to compute the Monte-Carlo integration of this estimate. Techniques based on 
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this approach have been suggested by Ahmad and Lin pQ, and studied by several 
authors under different assumptions (see, e.g., Ahmad and Lin [2], Eggermont and 
LaRiccia |15j . Mokkadem [HE])- An interesting point is that in our setup, we can 
"recycle" the simulations already used to compute the p"(log(/9)7v's. We denote in 
the sequel our entropy estimate of H(p™) by TiNipf), which will be defined and 
studied in Sectional We define accordingly 

D N (p?,p%J) = HnQ?) - H N {p n 2 ) +p n 2 {Wv)N-p n 1 (Wv)N- 

Our methodology can be applied in actual situations in the following way: assume 
we have k possible simulation strategies si, . . . , Sf. to sample from /, resulting in k 
successive densities p™, i = 1, . . . , k, n > 0. Let p® = p° be the common initial 
distribution for the k algorithms. The determination of the best algorithm among 
the k candidates can be done using the steps below: 

1. select the best strategy Sb between s\ and s 2 , on the basis of the sign of (the 
plot of) n I— ► Dn{Pi ,p 2 , /)) fo r n = lj...,no, where no is the simulation 
duration; 

2. store the sequence of estimates {Ti-N(Pb) ~ Pb(^°Sf)N}i<n<n '-, 

3. for i = 3, . . . , k, 

(a) select the best strategy between Sb and Sj, as in step 1. Notice that 
the computation of Djy^p^p™, f) just requires now that of {Wjv(pf) — 

l<n<no i 

(b) update b and the sequence {TCiy(p^) — p^(log 9?)jv}i<n<n - 

In practice, no can be chosen small, since the best strategy is usually determined 
during the very first iterations. For the one or two dimensional examples simulated 
in Sectional we have observed that values of no between 10 and 30 is often sufficient. 
The point is that the difference between the entropy contraction rate of each strategy 
is obvious at the very first iterations. 

Storing at each step the sequence of estimates of the best strategy {Ti-N^Pb) ~ 
Pb (log (p)N}i<n<n clearly saves computing time; the total number of simulations 
required is thus N x k x uq. Concerning the computer investment, a C program for 
doing the parallel (i.i.d.) simulations together with entropy estimation for a generic 
MH algorithm is available (from the first author) as a starting point. 

As stated previously, the technical part of the paper focus on the MH algorithm. 
Section |21 outlines some links between ergodicity and convergence to zero of fC(p n , f) 
as n — ► oo. In Section we establish assumptions on the proposal density q, f and 
the initial density p° to insure that, at each time n, adequate smoothness conditions 
hold for the successive densities p n , n > 0. These conditions are stated for the 
general (multi-dimensional) case, and detailed precisely in the Appendix for the 
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one-dimensional situation, for which practical examples are given. In Section EI 
these conditions are used to define an estimate of 7i{p n ) on the basis of the i.i.d. 
simulation output. Finally, Section 03 illustrates the behavior of our methodology 
for synthetic one and two-dimensional examples. 

We provide in Section0]theoretical conditions under which our criterion is proved 
to converge, and check in the appendix that these conditions are satisfied in some 
classical simple situations, to show that it can reasonably be expected to be a good 
empirical indicator in general situations for which the technical conditions are hard 
to verify. However, it is important to insist on the fact that, from the methodological 
point of view, our comparison criterion may be applied to far more general MCMC 
situations than the MH algorithm. For example, the homogeneous Markov property 
of the simulated processes does not play any role in the convergence of the entropy 
estimates of Ti.(p™), since these estimates are based on i.i.d. copies at time n. Hence 
our methodology may be applied to compare the different adaptive sampling schemes 
proposed in recent literature (see, e.g., Haario et al. |2E], Atchade and Rosenthal [I], 
Pasarica and Gelman |34| ) . Indeed, we empirically used a preliminary version of 
this criterion to evaluate the adaptive MCMC method proposed in Chauveau and 
Vandekerkhove [7j, and we apply it successfully in Section|S]to another adaptive MH 
algorithm. 



2 Kullback divergence to stationarity 

In this section we show a property of the evolution in time of the Kullback-Leibler 
divergence between the distibutions p n of the MH algorithm and the target dis- 
tribution /. It has been proved (see, e.g., Miclo [5T]) that for countable discrete 
Markov chains, the Kullback-Leibler divergence between the measure at time n and 
the stationary measure with density / (also denoted /) decreases with time, i.e. 
that fC{mP, f) < IC(m, f), where mP is the transportation of a measure m with the 
Markov kernel P, defined by mP{-) = J m(dx)P(x, •). 

We denote in the sequel the supremum norm of a real-valued function ip by 

||<p||oo := sup |<p(x)|. (4) 

We first recall a result due to Holden |28| assessing the geometric convergence of the 
MH algorithm under a uniform minoration condition: 

If there exists a G (0, 1) such that q(y\x) > af(y) for all x,y £ Cl, then: 

(5) 

oo 

We use this result to show that the Kullback-Leibler divergence between p n and 
/ decreases geometrically fast in this case: 



Vy G ft, 



p n {y) 



f(y) 



<(i 



/ 
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Proposition 1 If the proposal density of the Metropolis-Hastings algorithm satifies 
q(y\%) > a f{y)i f or a ll x i y & an d a £ (0, 1), then 



K,(p n J)<K P n (l + K P n ) 1 

where k = \\p° / f — l||oo > 0, and p = (1 — a). 



(6) 



Proof. Using equation El we have: 

ic(p n ,f) = 

p n (y) 



< 



log 



i 



+ 1 



p n (y) 



i 



+ l)f(y)dy 



< \0g(n P n + l)(K P n + 1) < K P n (K P n + 1). 



□ 



More generally, the question of the convergence in entropy of a Markov process 
is an active field of current research; see, e.g., Ball et al. 0, Del Moral et al. and 
Chauveau and Vandekerkhove jHj. 



3 Smoothness of MCMC algorithms densities 

For estimating the entropy of a MCMC algorithm successive densities p n , n > 0, 
we have to check that appropriate smoothness and tails technical conditions on 
these successive densities hold. In our setup, it appears tractable to apply results 
on entropy estimation based on a Lipschitz condition. Remember that a function 
cp : Q — > R is called c-Lipschitz if there exists a constant c > such that, for any 
y,z e O, \cp(y) - <p(z)\ < c\\y - z\\. 

As stated in the introduction, we will essentially focus on the MH case, essentially 
because its kernel is "generic", depending only on q and /. However, there is a major 
difficulty in this case, coming from the fact that the MH kernel has a point mass at 
the current position. 

The difficulty for the Gibbs sampler is that its successive densities are given by 

P n+1 (y) = / P n (x)g(x,y)dx, 



i) x • • • x fs(ys\Vl, ■ ■ -,y s -i) (7) 



where g is the density of the Gibbs kernel, 

g(x,y) = fi(yi\x2, ■ •• ,x s ) x 72(2/2 1 2/1,^3, 



and Lipschitz condition on p n depends heavily of the decomposition of /. We in- 
deed obtained a Lipschitz conditionfor the first iterations in the case of a toy-size 
Gibbs sampler (s=2), but stating conditions at a reasonably general level seems not 
possible. 
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3.1 The MH Independence Sampler case 

From the description of the MH algorithm in Section^ we define the off-diagonal 
transition density of the MH kernel at step n by: 

and set the probability of staying at x, 

r(x) = 1 — J p(x,y)dy. 

The MH kernel can be written as: 

P(x,dy) =p(x,y)dy + r(x)6 x (dy), (9) 

where 8 X denotes the point mass at x. 

We focus first on the IS case (q(y\x) = q(y)) since it allows for simpler conditions. 
We will see that the minorization condition q(x) > af(y) which implies geometric 
convergence of the IS is also needed for our regularity conditions. One may argue 
that, in this case, it is also possible to use an importance sampling scheme (see, e.g., 
Douc et al. [IB])- This strategy garanties i.i.d. simulated values for /, but requires 
the normalization of the estimate (since the normalization constant C is unknown) , 
which may lead to large variance. 

Let p° be the density of the initial distribution of the MH algorithm, which will 
be assumed to be "sufficiently smooth", in a sense that will be stated later. We will 
assume also that the proposal density q and the target p.d.f. / are also sufficiently 
smooth. 

From 0, the successive densities of the IS are given by the recursive formula 
P n+1 (y) = q(y) P n (x)a(x,y)dx+p n (y) q(x)(l-a(y,x))dx (10) 



= q(y)l n (y)+P n (y)(l-I(y)), (11) 

where 

In(y) = J p n (x)a(x,y)dx, n>0 (12) 

I(y) = J q(x)a{y,x)dx. (13) 

For convenience, we introduce the notations 

a{x, y) = (f>{x, y) A 1, (f>{x, y) = ——, h(x) = -—. 

h{y) f{x) 
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We consider the first iteration of the algorithm. From the regularity prop- 
erties of the density p 1 are related to the regularity properties of the two parameter- 
dependent integrals I\ and /. Regularity properties of such integrals are classically 
handled by the theorem of continuity under the integral sign (see, e.g., Billingsley jH] 
Theorem 16.8 p. 212). Continuity is straightforward here: 

Lemma 1 If q and f are strictly positive and continuous on Q C M s ; and p° is 
continuous, then p n is continuous on £1 for n > 1. 

Proof. It suffices to prove continuity for p 1 (y) at any yo £ ^- The integrand of Iq, 
p°(x)a(x,y), is continuous in y at yo for any ig!!, and 

\p°(x)a(x,y)\ < p°(x), \/x,y££l. 

Then Iq is continuous at y by the Lebesgue's dominated convergence theorem (since 
Q is a metric space, so that continuity can be stated in term of limit of sequence). 
The same reasonning applies to I(y) by using q for the dominating function. □ 

From equation (|11|). we have directly that 

\p n+ \y)-p n+1 (z)\ < \\q\\oo\In(y)-In(z)\ + \\I n \\oo\q(y)-q(z)\ 
+ \\l-I\\ oc \p^y)-p^ Z )\ 

+ ||p n ||oo|/(y)-/(*)|, (14) 

so that, to prove recursively that p n+1 is Lipschitz, we have first to prove that I n 
and / are both Lipschitz. 

Lemma 2 If f jq is C\-Lipschitz, and f p°h < oo, then for all n > 1: 

(i) J p n h < oo; 
(ii) I n is (ci J p n h)- Lipschitz. 



Proof, first we have to check that J p°h < oo can be iterated. This comes directly 
from the recursive definition ()1(J|) (since < r(x) < 1): 



J p\y)Ky) 



dy 



p°(x)p(x, y) dx + p°(y)r(y) 



h(y) dy 



< 



p°(x)4>(x, y) dx 



m 

p°(y)Hy) d y < 00 • 
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Hence J p°h < oo f p n h < oo for n > 1. Then, we have 

\I n (y) - I n (z)\ < J p n (x)\a(x,y) - a(x,z)\dx 

< p n (x)\4>(x , y) — 4>(x , z)\ dx 




dx 



□ 



Note that the hypothesis that f/q is Lipschitz is reasonable in the IS context. 
Indeed, one has to choose a proposal density q with adequate tails for the MH to 
be efficient, i.e. to converge quickly. As recalled in the introduction, it has been 
proved that the IS is uniformly geometrically ergodic if q(y) > af(y) for some a > 
(Mengersen and Tweedie [50]). Actually, these authors also proved that the IS is 
not even geometrically ergodic if this condition is not satisfied. But satisfying this 
minoration condition requires q to have tails heavier than the tails of the target /. 
Hence, common choices for implementing the IS make use of heavy-tailed proposal 
densities (e.g., mixtures of multidimensional Student distributions with small degrees 
of freedom parameters), so that f/q is typically a continuous and positive function 
which goes to zero when ||a;|| — > oo. It can then reasonably be assumed to be 
Lipschitz. This condition in lemma [2] may thus be viewed as a consequence of the 
following assumption, which will be used below: 

Assumption A: q and f are strictly positive and continuous densities on £1, and q 
has heavier tails than f , so that ]imu y \\-+ 0O = +oo. 

We turn now to the second integral I(y) = J q(x)a(y,x) dx. The difficulty here 
comes from the fact that the integration variable is now the second argument of 
a(-, •). Hence, applying the majoration used previously gives 



and since we have made the "good" choice for the proposal density (assumption A), 
h = q/f is obviously not Lipschitz. 

A direct study of a(-,x) = [h{-)/h{x)\ A 1, as it appears in I(y) (equations (|TTj) 
and l|13j)) is needed here. Consider a fixed i G fi in the sequel. Clearly, there exists 
by (A) a compact set K(x) such that for any y ^ K(x), h(y) > h(x). This entails 
that 



Now, for any y G K(x), a(y,x) is a continuous function truncated at one, so that it 
is uniformly continuous. If we assume slightly more, i.e. that a(-,x) is c(x)-Lipschitz, 
we have proved the following Lemma: 



\Hy) ~ !( z )\ < / Q{xM(y,x) - <f>(z,x)\ dx 




Vy i K(x) 



a{y,x) = 1. 
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Lemma 3 If assumption A holds, and if for each x there exists c{x) < oo such that 
Vy,zeK(x), \a(y,x) - a{z,x)\ < c(x)\\y - z\\, (15) 

where c(x) satisfies 

J qi xHx)dx<c , (16) 
then I satisfies the Lipschitz condition: 

V(y,z)en 2 , \I(y)-I(z)\ <( f q(x)c(x)dx) \\y - z\\. 



Examples where lemmainiholds will be given in the Appendix, for the one-dimensional 
situation. 

Proposition 2 If conditions of Lemmas^\^ and\^hold, and if 

(i) IMloo = Q < °o and q is c q -Lipschitz; 
(ii) | |p° 1 1 oo = M < oo and p° is cq- Lipschitz; 

then the successive densities of the Independance Sampler satisfy a Lipschitz condi- 
tion, i.e. for any n > 0, there exists k{n) < oo such that 

V(y,z)en 2 , \p n (y)-p n (z)\ < k(n)\\y-z\\. (17) 



Proof. Using equation (|14[) . and the fact that 

||-fn||oo < Jp n (x)dx = l, \\I\\od < Jq(x)dx = l, 

and 

\\p n \\oo < Q||/n-l||oo + |b n_1 ||oo 111 -I(y)\\oo 

< nQ + M, 

we obtain 

\p n+ \y)-p n+ \z)\ < Q\l n (y)-l n {z)\ + \q(y)-q(z)\ 

+\p n (v) ~ P n {*)\ + {nQ + M)\I(y) - I{z)\. 
Thus, applying this recursively, (|T7j) is satisfied, with 

k(n) = Qc\ j p n (x)h(x) dx + c q 

+ ((n - l)Q + M) j q(x)c(x) dx + k(n - 1), n > 2 



□ 
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3.2 The general Metropolis-Hastings case 

When the proposal density is of the general form q(y\x) depending on the current 
position of the chain, the successive densities of the MH algorithm are given by 

P n+1 {y) = J p n (x)q(y\x)a(x,y)dx + p n {y) J q(x\y)(l - a(y,x))dx 

= J n (y)+P n (y)(i-J(y)), (18) 

where 

J n (y) = J p n (x)q(y\x)a(x,y)dx, (19) 

J{y) = I Q{x\y)a(y,x)dx. (20) 



In comparison with the IS case, the continuity already requires some additional 
conditions. Let B(yo,5) denotes the closed ball centered at yo £ Q,, with radius S. 

Lemma 4 If q(x\y) and f are strictly positive and continuous everywhere on both 
variables, and p° is continuous, and if: 

(i) swp X;y q(x\y) < Q < oo ; 

(ii) for any y £ O and some 5 > 0, sup^ eB ( y0;(5 ) q{x\y) < ip yo ,s(x), where ip y0t s is 
integrable; 

then p n is continuous on Vt for n > 1. 

Proof. As for Lemma it is enough to check the dominating conditions of, e.g., 
Billingsley ^6 , p. 212. However, for J, we need the local condition (ii) to prove the 
continuity of J{y) at any yo G Q. □ 

Note that the additional condition (ii) is reasonable. For instance, we refer to 
the most-used case of the RWMH with gaussian perturbation of scale parameter 
a 2 > 0. In the one-dimensional case, q{x\y) is the pdf of M(y, a 2 ) evaluated at x, 
and one can simply take for condition (ii) 

V yo ,s(x) = q(x\y - 5)I x<yo _$ + q(y - 5\y - 5)I [yo _ S:yo+s] (x) 

+q(x\yo + 6)I x>yo+s , (21) 

(i.e. the tail of the leftmost gaussian pdf on the left side, the tail of the rightmost 
gaussian pdf on the right side, and the value of the gaussian at the mode inside 

[yo -5, yo + £])• 

To prove that the successive densities p n of the general MH algorithm are Lip- 
schitz, we proceed using conditions at a higher level than for the IS case, because 
the successive densities are more complicated to handle 
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Proposition 3 If conditions of Lemma^hold, and if 

(i) | 1 1 oo = M < oo and p° is co-Lipschitz; 

(ii) q(-\x)a(x, ■) is c\{x)- Lipschitz, with J p n (x)c\{x) dx < oo, 
(Hi) J(-) is C2-Lipschitz, 

then the successive densities of the general MH satisfy a Lipschitz condition, i. e. for 
any n > 0, there exists i{n) < oo such that 

V(y,z)£n 2 , \p n (y)- P n (z)\ <£{n)\\y-z\\. (22) 

Proof. First, it is easy to check that, similarly to the IS case, ||J n ||oo < Q, 
1 1 J"| |oo < 1, and ||p n ||oo < n-Q + M. Then, using the decomposition 

\p n+1 (y)-P n+1 (z)\ < \j n ( y )-j n ( z )\ + 2\p n ( y )-p n (z)\ 

+\\p n \\oo\J{y)-J{z)\, 

equation 1)22(1 is clearly a direct consequence of conditions (ii) and (hi), and the 
£(n) 7 s can be determined recursively as in the proof of Proposition |2j □ 

PropositionOlmay look artificial since the conditions are clearly "what is needed" 
to insure the Lipschitz property for p n . However, we show in the Appendix (sec- 
tion E2J) that these conditions are reasonable, in the sense that they are satisfied, 
e.g., in the one-dimensional case for usual RWMH algorithms with gaussian proposal 
densities. 



4 Relative entropy estimation 



Let Xat = (Xi, . . . , Xn) be an i.i.d. iV-sample of random vectors taking values in W, 
s > 1, with common probability density function p. Suppose we want to estimate 
the relative entropy of p, Ti(p) given by (^Q), assuming that it is well defined and 
finite. Various estimators for TL{p) based on have been proposed and studied 
in the literature, mostly for the case s = 1. One method to estimate TC(p) consists 
in obtaining a suitable density estimate pjy for p, and then susbtituting p by pn 
in an entropy-like functional of p. This approach have been adopted by Dmitriev 
and Tarasenko HHIE], Ahmad and Lin [T]j21, Gyorfi and Van Der Meulen |23][!H]. 
and Mokkadem who prove strong consistency of their estimators in various 
framework. More recently Eggermont and LaRiccia ^S] prove, that they get the best 
asymptotic normality for the Ahmad and Lin's estimator for s = 1, this property 
being lost in higher dimension. Another method used to estimate TC(p) is based on 
considering the sum of logarithms of spacings of order statistics. This approach was 
considered by Tarasenko |39| . and Dudewicz and Van Der Meulen fTl]. 
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In our case, and due to the rather poor smoothness properties that can be proved 
for the densities p n we have to consider, we use the entropy estimate proposed by 
Gyorfi and Van Der Meulen |24| . but with smoothness conditions of Ivanov and 
Rozhkova [2H]: a Lipschitz condition which appeared tractable in our setup, as 
shown in Section 03 



Following Gyorfi and Van Der Meulen [23], we decompose the sample X^r into 
two subsamples Yjy = {Vj} and Z^r = {Zi}, defined by 

Yi = X 2i for i = 1, . . . , [N/2] (23) 
Z t = X 2i ^ fori = l,...,[(iV + l)/2], (24) 

where [N] denotes the largest integer inferior to N. 

Let pn{x) = pat(x,Zjv) be the Parzen-Rosenblatt kernel density estimate given 

by 

1 [(^+!)/2] f _ y V 

™M = kT(NTW g M^t)' <25) 

where the kernel K is a density and hjy > with liniTv^oo hw = 0, and lirn/v— >oo Nh% = 
oo. The entropy estimate TLn{p) = Hjv,Y,z(p) introduced by Gyorfi and Van Der 
Meulen [22], is then defined by: 

1 [N/2] 

Hn ^ = [N/2} ^ ^SPK^hpNiYi)^*} ( 26 ) 
where < a at < 1 and limjv^oo a at = 0. 

Theorem 1 Assume thatTC(f) < oo. For a// n > 0, let X^y 6e a N- sample from p n , 
the p.d.f. of the MH algorithm at time n, and consider the kernel density estimate 
P n n given in WW > based on the subsample Zjy defined in \2J$ . Let the kernel K be a 
bounded density, vanishing outside a sphere S r of radius r > 0, and set hjy = N~ a , 
< a < 1/s. Consider the entropy estimate 71m defined in \26\) with 

a N = {\ogN)- 1 . (27) 

Assume that there are positive constants C, ro, a, A and e, such that either: 

(i) in the case of the Independance Sampler: f , q and po satisfy conditions of 
Proposition^ q satisfies the minoration condition q{y) > af(y), and f satisfies 
the tail condition 

f(v)< || Usn C M , h2+e , for\\y\\>r ; (28) 

llyll ( lo g||y||) + 
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(ii) in the general MH case: f, q and po satisfy conditions of Proposition^- q is 
symmetric (q(x\y) = q(y\x)); < A, and f satisfies the tail condition 

l + ||y|| + 

Then, for all n > 0, H N (p n ) ^ H{p n ), as N -» oo. 

Proof. This result uses directly Gyorfi and Van Der Meulen's Theorem in |24| p. 
231. Conditions (JSEJ) or (J2SJ) and the fact that H(f) < oo implies, for all n > 0, the 
same conditions on the densities p n in either cases (i) or (ii). Actually, TC(f) < oo 
is a direct consequence of Q and of the positivity of /C. For the tail condition ((2 
case (i), it suffices to notice that from © we have for all i6!J: 

0< P n (x) < f(x) + K P n f(x) 
C(l + K P n ) 



< 



|x|| s (log ||x||) 2+e ' 



The tail condition for the general case (ii) comes directly from the recursive for- 
mula (|T8|) since 

P 1 (y) = My) +P°(y)( 1 - J(y)) < J P°(x)q(y\x)a(x,y)dx +p°(y) 



< I p {x)q{y\x)—— dx +p [) {y) 
J fix) 

< Af(y) J q(x\y)dx+p°(y) < 2Af(y). 
Applying this recursively gives 

2 n AC 



p n {y) < 2 n Af(y) < 



i + l y 



I S+E ' 



which is stricter than Gyorfi and Van Der Meulen's tail condition. As to smoothness, 
the conditions of our Proposition [3 for case (i), and Proposition for case (ii) give 
the Lipschitz condition of Ivanov and Rozhkova 29 for p n , which in turn is stricter 
than Gyorfi and Van Der Meulen's smoothness condition, as stated in Gyorfi and 
Van Der Meulen [Mj. □ 



5 Examples 



We give in this section several examples for synthetic models, with target densi- 
ties which are one and two-dimensional mixtures of gaussian distributions. The 
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advantage of taking a mixture is that it is an easy way to build multimodal target 
densities with "almost disconnected" modes, i.e. separated modes with regions of 
low probability in between (see figures ^ and |HJ) . 

The difficulty for classical RWMH algorithm is then to properly calibrate the 
variance of the random walk to propose jumps under all the modes in a reasonable 
amount of time. The difficulty for the IS is to use a good proposal density q, 
hopefully allowing sufficient mass over the modal regions. 

It is important to understand that in all the examples below, the target density 
is completely known so that, instead of estimating the difference /C(p™, /) — K(p2, f) 
between any two given strategies, we are able to estimate directly IC(pf, f) for each 
strategy i leading to the successive densities pf separately. Actually, we can compute 
the strongly consistent estimate 

1 N 

Kn(p?, f) = H N (p?) --5>g /(Xf), 

3=1 

where the X*' n 's, j = 1, . . . , N are i.i.d. ~ pf. 

We give the results in terms of these estimates, since they provide easier com- 
parisons and illustrate more clearly the behaviour of our method. However, one 
should keep in mind that in real-size situations, only the plots of the differences are 
accessible to computation. This is not a flaw in the method since clearly, the better 
algorithm can be deducted from these plots. For complete illustration, however, we 
have also provided for the first example the plots of n i— ► D(p™,p2, /) for comparing 
three strategies. The only information not provided by the plot of the difference is 
the "convergence time" of each chain (in the sense of the convergence assessment of 
MCMC, see, e.g., Gilks et al. Indeed, even if the difference goes about zero 

at time n, there is always a possibility that both MH algorithms fail to converge at 
that time, with K(pf, f) « /C(pJ, /). 

Since the models are quite simple here, we could ran a large number of i.i.d. 
Markov chains to obtain precise estimates. So we tried up to N = 1000 chains 
for the one-dimensional model, and up to N = 200 chains for the two-dimensional 
model. This number of parallel chains can be reduced without compromising the 
decision for real-size applications (we tried N = 100 chains with satisfactory results). 
Note also that the computing time needed, even with large N, is not long since the 
duration no of the parallel simulation is itself short: the best algorithm is quickly 
detected, as shown in the figures below. Finally, for computing the estimate of the 
entropy (|2fjjl. we use a treshold otv = 0(log(iV) _1 ) instead of (|27|). to avoid rejection 
of too many observations for small values of N. 
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5.1 A one-dimensional example 

To illustrate the relevance of our comparison of the convergence rates, we first choose 
a very simple but meaningful situation, consisting in MH algorithms for simulation 
from a mixture of 3 gaussian distributions, with density 

3 

f( x ) = ^2 a i <Pfa tH, of), (30) 
i=i 

where (p(-)fj,, a 2 ) is the pdf of M(fj,, a 2 ). The chosen parameters a± = 0.5, ai = 0.3, 
[Ml = 0, [i2 = 9, //3 = —6, a\ = 2, and a\ = a\ = 1, result in the trimodal pdf 
depicted in figure ^ 



o . be 

O ./0 4 ; 

\oi 02 



Figure 1: True mixture density J23). 



Independence Sampler We first ran the independence sampler with a gaussian 
proposal density q = AA(0, a 2 ), for several settings of the variance parameter. None of 
these MH are optimal, since q does not have modes at —6 and 9, whereas / has. If the 
variance is too small (e.g., a < 1), the algorithm can "almost never" propose jumps 
outside, say, [—4; +4], so that it can (almost) never visit the right or left modes. 
The algorithm requires then a dramatically long time to converge. Our Kullback 
divergence estimate reflect this fact (figure |U left). For more adapted settings like, 
e.g., a = 3, the algorithm converges faster, and the convergence again deteriorates 
as a increases above, say, 10, since the proposal density is then overdispersed (see 
figure El right). 

For this example, we also provide in figureOltwo examples of the plots available in 
actual situations, i.e. that of n i— > D(p™,p%, f). The sign of the plots in both cases, 
and for the first iterations, clearly indicate that, in both cases, the first strategy 
(qi = AA(0,3 2 )) is preferable. Morevover, the comparison of the two plots indicate 
that a = 100 is even worse than a = 30. 

To check our estimates with heavy-tailed proposal densities, we also ran the in- 
dependence sampler with a Student proposal density t(d), for d = 1 (the Cauchy dis- 
tribution), up to d = 100 (for which the Student is almost the normal distribution). 
As expected, the algorithms converge faster when they use Student distributions 
with heavier tails, since in that case they can still propose jumps in the left or right 
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Figure 2: Plots ofn^ K.n{p 71 , f) for the IS with a gaussian proposal 7V(0, a 2 ). Left: a = 3 
(solid) vs. (7 = 1 (dashed); Right: a — 3 (solid) vs. a — 10 (dashed), a = 30 (dotted), 
a = 100 (Jong dashed). 

mode. When ci — > oo, the proposal converges to the M(0, 1), and the IS shows the 
same behavior as the previous one, with a = 1 (compare figure [2 left with figure HJ 
right). 




Figure 3: Plots of the difference n i— > Z?(p",p'!, 1 , /) for the IS with a gaussian proposal 
q t = 7V(0, a 2 ). Left: a t = 3 vs. ct 2 = 30; Right: a x = 3 vs. ct 2 = 100. 
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Figure 4: Plots of n i— ► K-N{p n ,f) for the IS with a Student proposal t(d): Left: d = 1 
(solid), d = 2 (short dashed), d = 3 (dotted), d = 10 (Jong dashed). Right: d = 3 (solid), 
d = 20 (short dashed), d = 50 (dotted), d = 100 (Jong dashed). 

RWMH We also ran on the same example a random walk MH algorithm with a 
gaussian proposal = AA(x, o -2 ), and several settings for ct 2 . As expected in view 

of the region of interest for the target /, a good choice is about a = 10. For too small 
settings (e.g., a = 0.1), the jumps of the random walk (of order ±3o~) are too small, 
so that the chain needs a dramatically long time to reach the rightmost mode. This 
is clearly indicated by our estimate in figure El right, where up to n = 600 iterations 
are needed for this inefficient algorithm to converge. 
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Figure 5: Plots of n i— > Kn (p n , /) for the RWMH with gaussian proposal q(-\x) = Af(x, a 2 ) : 
left: a = 1 (dashed) vs. er = 10 (solid); right: o = 0.1 (dashed) vs. a = 10 (solid). 

5.2 A two-dimensional example 

We also tried for the target density a two-dimensional gaussian mixture, depicted 
in figure EJ left (the true parameters are not given for brevity). For this example, 
we compare three "good" strategies of different types: (i) An IS with a uniform 
proposal density over the compact [— 20;20] 2 ; this algorithm is "almost geometric" 
since the mass of the tails of / outside the compact are negligible (the minoration 
condition q > af is fulfilled on the compact), (ii) A RWMH with a bivariate gaussian 
proposal q(-\x) = M(x,a 2 I), with a "good" setting a = 17, founded using our 
Kullback divergence method, (iii) An adaptive MH algorithm following the ideas 
in Chauveau and Vandekerkhove |7j. In short, parallel chains started with the IS 
(i) are ran and kernel density estimates are built at some specified times using the 
past of these i.i.d. chains. For example, the proposal density built at time n = 48 is 
depicted in figure El right. 

The estimates of lC(p n ,f) for this setup (and N = 200 chains) are given in 
figure [Tj As expected, the IS with the uniform proposal density performs better 
than the calibrated RWMH. But the adaptive proposal is even better than the two 
others. The times at which the adaptive proposal density is updated (n = 5 and 9), 
are even visible in the plot of lC(p n , /) for this strategy, which means that it has an 
immediate effect on this divergence. 




Figure 6: left: true target pdf; right: adaptive proposal density. 
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Figure 7: Plots ofn i— > K,N(p n , /) based onN = 200 chains for the IS with adaptive proposal 
(dashed) vs. left: IS with q = Uniform distribution; right: RWMH with q(-\x) =N(x 1 cr 2 I), 
a = 17. 



6 Conclusion 

We have proposed a methodology to precisely quantify and compare several MCMC 
simulation strategies only on the basis of the simulations output. A procedure for 
applying our method in practice has been given in Section^ 

A novelty is that this methodology is based upon the use of the relative entropy 
of the successive densities of the MCMC algorithm. A consistent estimate of this 
entropy in the MH setup has been proposed, and general conditions insuring its 
convergence have been detailed. Indeed, the conditions of propositions [2] and [3 arc 
difficult to verify in practice. However, the theoretical study of Section |3] has been 
developed to support the idea that, if the "ingredients" of the MH algorithm (q, 
p° and /) have sufficient tails and smoothness conditions, then one can reasonably 
expect that the successive densities p n , n > 1, of the MH algorithm will also satisfy 
these conditions, so that our usage of the estimates of the H(p")'s will indicate the 
most efficient MH algorithm to use. 

Our methodology is an alternative to the adaptive MCMC strategies, which 
represent an active field of the current literature on this field. The advantage of 
our approach is that it avoids the difficulties associated to adaptation, namely the 
preservation of the convergence to the desired limiting distribution, and the theo- 
retical guarantee that the adaptive algorithm will perform better than any other 
heuristical approach. 

Most importantly, our method can also be used as a global criterion to compare, 
on specific cases, the incoming new (eventually adaptive) MCMC strategies against 
existing simulation methods. Note in addition that, if the comparisons are done 
on simulated situations where / is entirely known, our approach gives directly the 
estimate of lC(p n , f) for each strategy instead of the difference between two methods. 
We used for this purpose in example 15.21 and the need for such a global comparison 
criterion on simulated situations is apparent in recent literature, as e.g. in Haario 
et. al |23] and |2B|. 
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7 Appendix 

The purpose of this appendix is to show that some of the conditions required in 
Proposition [21 and Proposition |31 which look difficult to check in actual situations, 
are satisfied at least in simple situations, for classically used MH algorithms in the 
one-dimensional case i.e. when i£i 



7.1 The one-dimensional independence sampler case 

In the IS case, the difficult conditions are conditions Q15|) and ()16|) of Lemma |21 
These conditions are simpler to handle in the one-dimensional case. First, note that 
we can prove under additional conditions the derivability of p n for all n > 1 (that 
proof is not given since we are not using it here). In the one-dimensional case, when 
q and / are in addition derivable, and have non-oscillating tails, assumption (A) 
leads to 

3mi < rri2 : Vx < mi, h'(x) < 0, and Vx > mi, h'(x) > 0. (31) 
For a fixed x E R, there exists by (|3Tj) a compact set K(x) = [a(x), b(x)] such that 

(i) [mi,m 2 ] C K(x); 

(ii) h(a(x)) = h(b(x)) = h(x); 

(iii) for any y ^ K(x), h(y) > h{x). 



As in the general case, this entails that Vy ^ K(x), a(y,x) 
Lipschitz condition on K(x): 

Vy,z, \a{y,x) - a(z,x)\ < c(x)\y - z\ 

the expression of c(x) can be precised 

d(j)(y,x) 



c(x) 



sup 

y&K{x) 



Oy 



< oo. 



1. If we have the 



(32) 



and Lemma |21 holds if the integrability condition Q16JI is satisfied. Note that |a(x)| 
and b(x) both go to +oo as \x\ — * oo; in particular, b(x) = x for x > m^. Hence 
c(x) — > cxd as \x\ — > cxd, and condition ((To]) is not always true, but merely depends 
on the relative decreasing rate of the tails of q and /. 

For an illustrative example, assume that the tails of / are of order x~@ , and the 
tails of q are of order x~ a . Satisfying assumption A requires that (3 > a. Now, one 
can always use the fact that 



c(x) < sup 



d(p(y,x) 



dy 
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so that if (3 — 1 < a < (3, then c(x) is of order x a ~@ for large x and (|16|) is satisfied. 
The condition a £ [(3 — 1,(3] states that the tails of q should be "not too heavy", 
compared with the tails of /. This requirement is obviously stronger than what is 
needed, but more precise conditions require some analytical expression of c{x) for 
x ^ [7711,7712], and this expression depends on a(x) and h! . 

Fortunately, condition (|16ft is satisfied in much more general settings. For in- 
stance, consider situations where / and q are both symmetric w.r.t. 0, so that 
K(x) = [—\x\, \x\] for x outside [7711,7712], and c(x) can be expressed in closed form. 
Then it is easy to verify that (fR))) holds for, e.g., / = Af(0, 1) and q = t(d), the 
Student distribution with d degrees of freedom, for d >2 (even if, for d = 2 the tails 
of q are of order x~ 3 ). In this example, the proposal density has tails much more 
heavier than /, but Lemma holds i.e., / is still Lipschitz. 



7.2 The one-dimensional general MH case 

In the general MH case, the difficult conditions are conditions (ii) and (iii) of Propo- 
sition 01 Our aim is to show that these conditions hold in the simple RWMH case 
with gaussian proposal density. In order to obtain a tractable case, let q(y\x) be 
the p.d.f. of the gaussian N(x,l), and / be the density of the target distribution 
AA(0,1). 

For condition (ii) we have to prove that q(-\x)a(x, •) is c(x)-Lipschitz, with 
f p n (x)c(x) dx < 00. Here q{y\x) = q(x\y), so that 

fix) fix) 

which is a truncated function such that, for any x, lim^i^^ a(x, y) = 0. In other 
words, both a(x,y) and q{y\x)a{x, y) have tails behavior for large y. The non- 
troncated function f x (y) = q(y\x) f (y) / f (%) is then Lipschitz, with 

c{x) = sup \<f' x {y)\ ■ 

y£R 

A direct calculation (feasible in this simple case) gives c(x) oc exp(x 2 — 2)/4. Since to 
ensure the tails conditions of the successive densities p n we have to assume that the 
initial distribution itself has tails lighter or equal to that of / (i.e. that | |oo < A 
see Theorem ^) then by the recursive definition of p n we have, as in the proof of 
Theorem ^ p n {y) < 2 n A/(y), so that Jp n {x)c{x)dx < 00, i.e. condition (ii) of 
Proposition |3] holds. 

We turn now to condition (iii) of Proposition 03 i.e. we have to show that J(y) 
given by (|2l7|) is Lipschitz. For fixed y, z G R, 

\J{y) - J{z)\ < \ \q(x\y)a(y, x) - q(x\z)a(z, x)\ dx. 
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As for the IS case, we need a precise study of the truncated function here. We 
assume first that z > y > 0. Since q is symmetric, 

f(y) 

and we can define two compact sets K(y) and K(z) by 

K(t) = {x G R : a(t,x) = 1} = {x G R : /(x) > /(*)} 

which, in the present situation, are just K{y) = [— y,y], K(z) = [—z,z], and satisfy 
K(y) C K(z). Hence 



\J(y)- J ( z )\ ^ / \q(x\y) - q(x\z)\ dx 

JK{y) 

J(x) 



L 



K{z)\K{y) 



q{x\y) 



f(y) 



q{x\z) 



IK(z) c 



<i\x\y)-— ~ i( x \ z )- 



dx 
dx, 



where K(z) c 
written 



\ K(z). Using the mean value theorem, the first term can be 



/ \q(x\y) - q(x\z)\ dx < I \q{x\y) - q(x\z)\ dx 

JK(y) J 



< \y-z\J 



x-y 
2vr 



■exp(-(x-y*) 2 /2) dx 



< 



\y - z \ 



(33) 



where the last inequality comes from the absolute first moment of the normal density. 

For the second term, consider first the integral on the right side of K(z) \ K(y), 
that is Jy \ipy :Z (x)\ dx, where 

<PvA x ) = i(. x \y)j^j - ^(x\z). 

In this simple setting, it is easy to check that fy, z (-) ^ s a bounded function, mono- 
tonically decreasing from ip y ^ z (y) = 5 — q{y\z) > to tp y , z (z) = <l{y\ z ) — <5 < 0, where 
$ = Q{y\y) is the value of the gaussian density at its mode. Hence 



Jy 



q(x\y)j^ - q{x\z) 



dx < S\y — z\ 



(34) 



The symmetric term |(^y iZ (x)| dx is handled in a similar way. 
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The third term can in turn be decomposed into 

dx < Q 



K(zY 



\\AV)-FT\ ~ l( x \ z )- 



f(y) 



K(z) c 



fix) fix) 



f(y) /(«) 



dx 



+ I \q(x\y) - q(x\z)\ dx, 

lK(z) c 



where, as in Proposition 03 Q = H^Hoo, and since sup^g^^c \ f{x)/f{z)\ = 1. Using 
the mean value theorem as for the first term, 



k(x\y) - q(x\z)\ dx < \ -\y - z\. 
K( z y v n 



(35) 



Finally, 



K{z) c 



fix) f(x) 



dx 



fix) fix) 



< 2 



1 



i 



< 2V2^ze z2/2 



dx 

fix) dx 



+ a/z 2 + 4/vr 



\y - z \ 



< D\y-z\ 



(36) 
(37) 



where the left term in (|36() comes from the mean value theorem applied to the 
function l//(-), the rightmost term in (|36|) is a well-known bound of the tail of the 
normal distribution, and 



D = sup 

2 eK 



2V2^ze z2 ' 2 - 



< oo. 



Z + y/z 2 + 4/7T 

Collecting (jSJ), (JHSJ and JSZ)) together shows that 

|J(y) - J(z)| < A;|y - z| for z > y > and < fc < oo. 
The other cases are done similarly, so that J(-) is Lipschitz. 
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